Spiritual Thought

DC 6:36: “Look unto me in every thought; doubt not, fear not.”

“As an apostle of the Lord Jesus Christ, I invoke these blessings upon you, that as you look to the Savior and trust in Him, you will be blessed with hope to overcome perplexity, with spiritual settledness to cut through commotion, with ears to hear and a heart to always remember the word of the Lord, and with the discernment to see things as they really are.”

David A. Bednar - BYU Speeches, April 16th, 2021

Grading Exercises

Remember that you grade your own submitted exercises using the rubric specified in each posted exercise solution, adding comments (directly for R scripts or using the comment feature for Word documents) where their solution differs from the posted solution.

# Find the customers who have spent the most recently.
customer_data |> 
  left_join(store_transactions, join_by(customer_id)) |> 
  # -1: I forgot to filter for the South and create a new age column.
  select(age, gender:region, dec_2018) |> 
  arrange(desc(dec_2018)) |> 
  slice(1:3)

Marketing Analytics Process

Discrete Data

Remember that summarizing data is initially all about discovery, the heart of exploratory data analysis.

  • Computing statistics (i.e., numerical summaries).
  • Visualizing data (i.e., graphical summaries).

How we summarize depends on whether the data is discrete or continuous.

  • Discrete means “individually separate and distinct.”
  • Discrete data are also called qualitative or categorical.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Can you identify any discrete variables? What are their data types in R?

customer_data <- read_csv("customer_data.csv")
## Rows: 10531 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): gender, married, college_degree, region, state, review_time, review...
## dbl (5): customer_id, birth_year, income, credit, star_rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summarize Discrete Data

An important statistic for a discrete variable is a count.

customer_data |> 
  count(region)
## # A tibble: 4 × 2
##   region        n
##   <chr>     <int>
## 1 Midwest    1101
## 2 Northeast  3224
## 3 South      1111
## 4 West       5095

How would I get a count by both region and college_degree (i.e., a cross-tab)?

customer_data |> 
  count(region, college_degree)
## # A tibble: 8 × 3
##   region    college_degree     n
##   <chr>     <chr>          <int>
## 1 Midwest   No               229
## 2 Midwest   Yes              872
## 3 Northeast No               640
## 4 Northeast Yes             2584
## 5 South     No               891
## 6 South     Yes              220
## 7 West      No               989
## 8 West      Yes             4106

Visualize Data

{ggplot2} provides a consistent grammar of graphics built with layers.

  1. Data – Data to visualize.
  2. Aesthetics – Mapping graphical elements to data.
  3. Geometry – Or “geom,” the graphic representing the data.
  4. Facets, Labels, Scales, etc.

Visualize Discrete Data

Plot our first summary (note how + is different from |>).

customer_data |> 
  count(region) |> 
  ggplot(aes(x = region, y = n)) +
  geom_col()

Visualize our second summary by adding the aesthetic fill = college_degree.

The geom position argument of the geom_col() function is set to stack by default. Try fill instead.

customer_data |> 
  count(region, college_degree) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col()

customer_data |> 
  count(region, college_degree) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill")

Facets

Facets allow us to visualize by another discrete variable. For example, is this relationship different depending on gender?

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill") +
  facet_wrap(~ gender)

Labels and Scales

It’s no longer a count on the y-axis. Let’s change the labels.

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill") +
  facet_wrap(~ gender) +
  labs(
    title = "Proportion of Customers with College Degrees by Region and Gender",
    subtitle = "Based on 10,531 Customers in the CRM Database",
    x = "Region",
    y = "Proportion"
  )

What about the legend? And these colors?

customer_data |> 
  count(region, college_degree, gender) |> 
  ggplot(aes(x = region, y = n, fill = college_degree)) +
  geom_col(position = "fill") +
  facet_wrap(~ gender) +
  labs(
    title = "Proportion of Customers with College Degrees by Region and Gender",
    subtitle = "Based on 10,531 Customers in the CRM Database",
    x = "Region",
    y = "Proportion"
  ) +
  scale_fill_manual(
    name = "College Degree",
    values = c("royalblue", "darkblue")
  )

Text Data

Text data is also discrete but it is unstructured.

  • Authors can express themselves freely.
  • The same idea can be expressed in many ways.

What sort of structure might we impose on text data so we can visualize it?

Tokenize Text Data

We can use unnest_tokens() to tokenize the text (i.e., split it into individual words or tokens).

review_data <- customer_data |>
  select(customer_id, review_text) |> 
  tidytext::unnest_tokens(word, review_text)

review_data
## # A tibble: 165,510 × 2
##    customer_id word        
##          <dbl> <chr>       
##  1        1001 everything's
##  2        1001 fine        
##  3        1002 <NA>        
##  4        1003 <NA>        
##  5        1004 <NA>        
##  6        1005 i           
##  7        1005 looked      
##  8        1005 all         
##  9        1005 over        
## 10        1005 the         
## # ℹ 165,500 more rows

Summarize Text Data

With the text data tokenized, we can compute counts just like other discrete data.

review_data |> 
  count(word) |> 
  arrange(desc(n))
## # A tibble: 10,178 × 2
##    word      n
##    <chr> <int>
##  1 the    7512
##  2 <NA>   7373
##  3 and    4633
##  4 i      4486
##  5 a      4176
##  6 to     3949
##  7 it     3581
##  8 for    2531
##  9 is     2419
## 10 of     2106
## # ℹ 10,168 more rows

Drop Missing Data

Missing values are (and should be) encoded as NA.

review_data <- review_data |> 
  drop_na(word)

review_data
## # A tibble: 158,137 × 2
##    customer_id word        
##          <dbl> <chr>       
##  1        1001 everything's
##  2        1001 fine        
##  3        1005 i           
##  4        1005 looked      
##  5        1005 all         
##  6        1005 over        
##  7        1005 the         
##  8        1005 internet    
##  9        1005 to          
## 10        1005 find        
## # ℹ 158,127 more rows

Remove Stop Words

Commonly used words aren’t very informative and are referred to as stop words.

tidytext::stop_words
## # A tibble: 1,149 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ℹ 1,139 more rows

This is just a data frame, and we know how to join data frames!

An anti join keeps rows that don’t have matching IDs and just the columns from the “left” data frame. (It’s nearly the opposite of an inner join.)

review_data <- review_data |>
  anti_join(tidytext::stop_words, join_by(word))

review_data |> 
  count(word) |> 
  arrange(desc(n))
## # A tibble: 9,552 × 2
##    word        n
##    <chr>   <int>
##  1 fit       403
##  2 product   378
##  3 easy      338
##  4 quality   330
##  5 nice      324
##  6 bag       287
##  7 price     281
##  8 time      278
##  9 love      264
## 10 size      247
## # ℹ 9,542 more rows

Visualize Word Counts

review_data |> 
  count(word) |> 
  arrange(desc(n)) |> 
  ggplot(aes(x = word, y = n)) +
  geom_col()

What can we do to make this plot readable?

Factors

Unlike a character variable, a factor can include information about order.

  • A factor’s levels are numeric values that encode order.
  • A factor’s labels are the character string associated with each level.
review_data |> 
  count(word) |> 
  arrange(desc(n)) |> 
  slice(1:10) |> 
  mutate(word = fct_reorder(word, n)) |>
  ggplot(aes(x = n, y = word)) +
  geom_col()

Wrapping Up

Summary

  • Computed counts, including tokenizing and counting text.
  • Practiced the basics of plotting with {ggplot2}.

Next Time

  • Summarizing continuous data with {dplyr}.
  • Visualizing continuous data with {ggplot2}.

Supplementary Material

  • R for Data Science (2e) Chapters 2 and 18

Artwork by @allison_horst

Exercise 3

In RStudio on Posit Cloud, create a new R script and do the following.

  1. Load the tidyverse.
  2. Import and explore customer_data using the functions we’ve covered.
  3. Provide at least one interesting numeric summary and one interesting visualization using discrete variables only.
  4. Practice good coding conventions: Comment often, write in consecutive lines of code using the |>, and use the demonstrated style (e.g., variable names, spacing within functions).
  5. Export the R script and upload to Canvas.